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Abstract 

The carcinoembryonic antigen (CEA) gene family belongs to the immunoglobulin (Ig) superfamily and codes for a vast number of 
glycoproteins that differ greatly both in amino acid composition and function. The CEA family is divided into two groups, the 
carcinoembryonic antigen-related cell adhesion molecules (CEACAMs) and the pregnancy-specific glycoproteins. The CEA family 
members are implicated in pleiotropic (patho)physiological functions including cell-cell adhesion, pregnancy, immunity, neovascu- 
larization, regulation of insulin homeostasis, and carcinogenesis. In general, the CGA-encoded proteins are composed of an extra- 
cellular region with Ig variable and constant-like domains and a cytoplasmic region containing signaling motifs. Of particular interest, 
the well-studied human and mouse CEA genes are arranged in clusters in a single chromosome. Taking into account this character- 
istic, we made an effort to reconstruct the evolutionary history of the CEA gene family. Toward this end, the publicly available 
genomes were searched extensively for CEA homologs. The domain organization of the retrieved protein sequences was analyzed, 
and, subsequently, comprehensive phylogenetic analyses of the entire length CEA homologous proteins were performed. A series of 
evolutionarily conserved amino acid residues, functionally important, were identified. The relative positioning of these residues on the 
modeled tertiary structure of novel CEA protein domains revealed that they are, also, spatially conserved. Furthermore, the chro- 
mosomal arrangement of CEA genes was examined, and it was found that the CEA genes are preserved in terms of position, 
transcriptional orientation, and number in all species under investigation. 
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Introduction 

The carcinoembryonic antigen (CEA) gene family, which be- 
longs to the immunoglobulin (Ig) gene superfamily, comprised 
an exceptionally diverse array of highly glycosylated glycopro- 
teins (Paxton et al. 1 987; Zhou et al. 2001). The CEA family is 
broadly divided into two groups, the CEA-related cell adhesion 
molecules (CEACAMs) and the pregnancy-specific glycopro- 
teins (PSGs) (Hammarstrom 1999). In humans, based on our 
current phylogenetic analysis, the CEA family consists of 35 
genes, 21 of which are protein coding, arranged in contiguous 
clusters in chromosome 19 in the region 1 9q1 3.2-1 9q1 3.4 
(Hammarstrom 1999). 

The CEA-encoded proteins have varying length and domain 
organization, which probably reflects their functional diver- 
gence. All currently reported CEA-encoded proteins consist 
of at least one Ig variable (IgV)-like domain, followed by a 
varying number of Ig constant (IgC)-like domains 



(Brummendorf and Rathjen 1995). The core structure of 
these domains, the Ig-like fold, is characterized by two [3- 
sheets (faces) that cross over each other. The IgV-like 
domain, approximately 1 10 amino acids long, contains a con- 
served basic (arginine) and an acidic (aspartate) amino acid, 
which are proposed to stabilize the Ig-like fold via an intrado- 
main salt bridge. The CFG-face of the IgV-like domain (named 
after the C-F-G strands it is composed of) mediates homotypic 
and heterotypic cell-cell adhesion (Taheri et al. 2000). The 
IgC-like domain, contains two conserved cysteine residues, 
that occupy the corresponding positions of arginine and as- 
partate, stabilize the Ig-like conformation by forming a disul- 
phide bridge (Williams and Barclay 1988; Bork et al. 1994). 

CEACAM genes are expressed in a wide variety of cell types 
including epithelial, endothelial, and immune cells such as leu- 
kocytes and dendritic cells, whereas PSGs are expressed exclu- 
sively in the placental trophoblasts (Hammarstrom 1999). 
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CEACAMs are either inserted into the cell membrane via a 
transmembrane (TM) domain or they are linked to the mem- 
brane via semipenetrating glycosylphosphatidylinositol (GPI) 
anchorage (Naghibalhossaini et al. 2007). The latter type of 
membrane anchorage has been detected only in primates 
thus far. The membrane-bound CEACAMs possess a C-termi- 
nal cytoplasmic domain, which may contain motifs associated 
with signal transduction (Hammarstrom 1999). 

Members of the CEA family are implicated in diverse phys- 
iological and pathological functions (Obrink 1997; Kuespert 
et al. 2006). For instance, CEACAMs play a vital role during 
embryonic development where cell-cell adhesion is necessary 
to integrate the cells into functional organs (Kuespert et al. 
2006). Members of the CEACAM group also serve as recep- 
tors of several bacterial and viral pathogens, such as the 
murine hepatitis virus, Haemophilus influenza, Neisseria men- 
ingitides, and N. gonorrhea, which bind CEACAM proteins via 
their N-terminal IgV-like domain (Bos et al. 1999; Virji et al. 
1999, 2000; Villullas et al. 2007). PSGs are secreted proteins 
from fetal trophoblasts, which are proposed to regulate the 
maternal-fetal interactions during pregnancy (Hau et al. 1 985; 
Ha et al. 201 0). Of particular note, CEA play an important role 
in carcinogenesis (Scorilas et al. 2003; Michaelidou et al. 
2013). The prototypic member of this family, human CEA 
(henceforth referred to as CEACAM5), was discovered by 
Gold and Freedman (1965) in the mid-1960s in the blood of 
patients with colon cancer. CEACAM5 is consistently overex- 
pressed in various malignancies frequently associated with 
poor patients' clinical outcome and reduced overall survival 
(Chevinsky 1991). These properties have made CEACAM5 a 
prominent clinical cancer biomarker, widely used in early di- 
agnosis, effective prognosis, and monitoring of colon cancer, 
as well as other types of cancers (Gaglia et al. 1988; Ballesta 
etal. 1995). 

In this study, we made an effort to reconstruct the evolu- 
tionary history of CEA gene family and identify conserved 
amino acids that may play important role in the overall struc- 
ture and function of CEA proteins. To this direction, the fully 
sequenced and the nearly complete sequenced genomes 
were searched for CEA homologs. Members of the CEA 
family were identified in diverse taxa covering an evolutionary 
range from cartilaginous fishes to human. Subsequently, com- 
prehensive phylogenetic analyses were performed employing 
the maximum likelihood (ML) and the neighbor-net methods. 
The genomic arrangement of the identified CEA-related genes 
was also analyzed, and it was shown that in different species, 
these genes are arranged in contiguous clusters with con- 
served position and orientation of transcription. On the basis 
of both the phylogenetic and syntenic analyses, we identified 
eight conserved gene clusters. Furthermore, the protein 
domain organization of the CEA homologs was examined 
and amino acid conservation patterns were identified. The 
three-dimensional (3D) structure of domains from species of 
the basal taxonomy was predicted with homology modeling, 



and the evolutionarily conserved amino acids were mapped 
onto these structures. 

Materials and Methods 

Sequence Database Searching 

The names or accession numbers of the characterized CEA 
reported in literature were used initially to retrieve their corre- 
sponding sequences from the publicly available nonredundant 
sequence databases ENSEMBL (Flicek et al. 2013), National 
Center for Biotechnology Information (NCBI)'s RefSeq (Pruitt 
et al. 2012), and UniProtKB (Magrane and UniProt 
Consortium 2011). To obtain more putative CEA homologs, 
these sequences were used subsequently as probes to perform 
extensive reciprocal BLASTp and tBLASTn (Altschul et al. 1 990) 
searches of genomes with high coverage (>6x) and low cov- 
erage (2x). This process was reiterated until convergence, that 
is, no novel putative CEA sequences could be detected. The 
longer known transcript was selected. The partial or ambigu- 
ous sequences were not included in the subsequent steps of 
the study. The Translate program (http:/A/veb.expasy.org/ 
translate/, last accessed May 27, 2014) was used to translate 
nucleotide sequences. 

Motifs Construction 

Representative CEA peptide sequences were aligned with 
MAFFT v.7 (Katoh and Standley 2013, 2014) and edited 
with Utopia suite's CINEMA alignment editor (Pettifer et al. 
2009). Sequence motifs were excised from the multiple se- 
quence alignments, manually edited for insertions or gaps. 
They were submitted to Webl_ogo3 (Crooks et al. 2004) 
with default options, to generate consensus sequences. 

Chromosomal Localization 

The chromosomal localization of the CEA genes was deter- 
mined using the ENSEMBL GeneView (Flicek et al. 2013) and 
the NCBI MapViewer (Wolfsberg 201 1). 

Alignments and Phylogenetic Analyses 

The full-length CEA amino acid sequences were aligned with 
MAFFT v.7. The resulting multiple sequence alignments were 
edited using CINEMA alignment editor (Pettifer et al. 2009). 
The trimmed alignments were then used to reconstruct phy- 
logenetic trees by employing two separate methods. To 
obtain ML-based trees, the method implemented in the soft- 
ware package MEGA, version 5.2 (Tamura et al. 2011) was 
used. In this study, a distance-based tree (BIONJ) (Gascuel 
1997) was used as seed, as well as the nearest-neighbor-in- 
terchange heuristic with five discrete gamma categories of 
evolutionary rates. The number of amino acid substitutions 
per position was estimated with the JTT model (Jones et al. 
1 992). Trees were also reconstructed employing the neighbor- 
net method (Bryant and Moulton 2004) implemented in 



Genome Biol. Evol. 6(6): 1 314-1 326. doi:10.1093/gbe/evu103 Advance Access publication May 23, 2014 



1315 



Pavlopoulou and Scorilas 



GBE 



SplitsTree v.4 (Huson 1998; Kloepper and Huson 2008), a 
distance-based method able to detect conflict between phy- 
logenetic signals in the form of networks; the Ucorrected P 
model of substitution was used. For both methods, bootstrap 
analyses (200 pseudoreplicates) were conducted to evaluate 
the statistical significance of the reconstructed trees. The trees 
generated with the ML method were illustrated with 
Dendroscope v.3 (Huson and Scornavacca 2012). 

Protein Domain Organization 

The consensus boundaries of the individual protein domains in 
CEA proteins were determined from the full-length CEA 
amino acid sequences combining the outputs of the search 
engines available in SMART v.7 (Letunic et al. 2012), PFAM 
v.27 (Punta et al. 201 2) and CDD v.3 (Derbyshire et al. 201 2) 
and InterPro v.42 (Hunter et al. 2012) protein signature data- 
bases. The TM regions were predicted with the programs 
MINNOU (Cao et al. 2006) and PRED-TMR2 (Pasquier et al. 
1999). 

Homology Modeling 

The 3D structures of the IgV-like domain of MedakaCea, 
FrogCea7, and LizardCeacam19, and the 3D structure of 
the IgC-like domain of the FrogCea2 and PlatypusCeal 
(target proteins) were predicted by homology modeling. The 
X-ray crystal structures of the murine Ceacamla (PDB ID: 
1L6Z) (Tan et al. 2002) and the human CEACAM1 (PDB: 
2GK2) (Fedarovich et al. 2006) were used as templates to 
model the IgC-like and IgV-like domain, respectively, with 
the modeling package Modeller (Sali et al. 1995). To remove 
any local constraints, the generated protein models were sub- 
jected to energy minimization using the Charmm27 forcefield, 
implemented in Gromacs v.4.5.5 (Hess et al. 2008). The qual- 
ity of the final modeled protein structures was evaluated using 
Procheck (Laskowski et al. 1996) and ANOLEA (Melo et al. 
1997). The protein models were illustrated with PyMol 
(DeLano 2002). Furthermore, the secondary structure of the 
TM domain of the sequences ZebrafishCeal, FrogCea7, 
LizardCeacam19, and HumanCEACAMI was predicted 
using the bioinformatics tools described in Pavlopoulou and 
Michalopoulos (2011). The predicted TM helices were mod- 
eled, template free, as described above. 

Results 

Identification of CEA Homologs 

In this study, we performed comprehensive and updated phy- 
logenetic analyses of the CEA homologs in the available ge- 
nomes of 33 species: Anolis carolinensis (lizard), Bos taurus 
(cow), Branchiostoma lancelatum (amphioxus), Callithrix jac- 
chus (marmoset), Can is familiaris (dog), Ciona intestinal is 
(ascidia), Danio rerio (zebraf ish), Dasypus novemcinctus (arma- 
dillo), Drosophila melanogaster (fruit fly), Equus caballus 



(horse), Gallus gallus (chicken), Homo sapiens (human), 
Latimeria chalumnae (coelacanth), Lepisosteus oculatus (spot- 
ted gar), Leucoraja erinacea (little skate), Loxodonta africana 
(elephant), Macaca mulatta (macaque), Microcebus murinus 
(mouse lemur), Monodelphis domestica (opossum), Mus mus- 
culus (mouse), Myotis lucifugus (microbat), Ornithorhynchus 
anatinus (platypus), Oryzias latipes (medaka), Otolemur gar- 
nettii (bushbaby), Pan troglodytes (chimpanzee), Pelodiscus 
sinensis (Chinese softshell turtle), Petromyzon marinus (lam- 
prey), Pongo abelii (orangutan), Rattus norvegicus (rat); 
Taeniopygia guttata (zebra finch); Takifugu rubripes (puffer- 
fish), Tupaia belangeri (tree shrew), and Xenopus tropicalis 
(frog). The genomes with high coverage were selected to 
avoid underestimation of the number of CEA genes, like in 
the case of low coverage genomes. Collectively, 207 CEA pro- 
tein-encoding genes, 13 pseudogenes, and 1 expressed se- 
quence tag (EST) sequence were identified in the genomes of 
20 species representing diverse eukaryotic taxonomic divisions 
(according to the NCBI taxonomy database [Federhen 2012]) 
(supplementary table S1, Supplementary Material online) 
rimates (87), rodentia (47), perissodactyla (9), cetartiodactyla 
(6), carnivore (10), afrotheria (5), xenarthra (7), metatheria 
(15), proteotheria (5), sauria (2), amphibia (13) and teleosts 
(14), and chondrichthyes (1). Despite extensive database 
searches, CEA homologs were not detected in the complete 
and well-annotated genomes of aves, insects, and in lower 
vertebrates such as craniata, cephalochordata, and ascidia. 

To prevent confusion, we used the revised nomenclature 
by Beauchemin et al. (1999), for human and rodent CEA se- 
quences; for the rest, we used the names provided in the 
original references and the sequence databases. Regarding 
the newly identified sequences (e.g., primate CEA), they 
were named by virtue of homology to their closest related 
well-annotated human and mouse CEA genes. The distant 
homologs (e.g., frog and fish CEA), with no significant se- 
quence similarity to the known CEA, were commenced by 
CEA followed by an ascending number depending on their 
order in the chromosome. 

Conserved Structural Features of CEA Proteins 

The CEA homologous proteins were found to differ greatly in 
their length and domain organization. On the basis of the 
combined output of the signature databases and the multiple 
alignment of CEA protein sequences, we determined the or- 
ganization of the three major protein domains in the extracel- 
lular region of the CEA proteins, namely IgV-like, IgC-like, and 
TM, the immunoreceptor tyrosine-based activation motif 
(ITAM) and immunoreceptor tyrosine-based inhibition motif 
(ITIM) in the cytoplasmic region and the GPI anchors. 
Furthermore, consensus protein motifs were derived from 
the multiple alignment of sequences that correspond to the 
three extracellular domains and the cytoplasmic domain and a 
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Fig. 1. — IgV-like domain. (A) Homology models of the IgV-like domain of MedakaCea (pink), FrogCea7 (green), and LizardCeacam19 (blue) in cartoon 
representation superposed on the human CEACAM1 (PDB ID: 2GK2) (gray). The residues arginine and aspartate that form a salt bridge are shown as a ball- 
and-stick representation. (B) The conserved proteins motifs derived from the IgV-like domain. The amino acid residue numbers (according to human 
CEACAM1) are indicated. The invariant residues arginine and aspartate are indicated by dots. The letters, representing amino acid residues of the motif 
sequences, are piled one on top of another at every position in the sequences. The height of each letter is proportional to the frequency of the corresponding 
amino acid at that position; the letters are ordered, so the most frequent one is on the top. The height of the whole pile is normalized, so that it indicates the 
information content (measured in bits) in each position. 



number of conserved amino acid residues were identified 
(figs. 1-4). 

Given that the 3D structure of a protein is more conserved 
than its corresponding amino acid sequence, an effort was 
made to map the position of these residues to the tertiary 
structure of representative CEA domains. Toward this end, 
the 3D structure of the IgV-like (fig. 1) and IgC-like (fig. 2) 
domains from putative, evolutionarily diverse CEA proteins 
were predicted with homology modeling using the resolved 
crystal structures of the IgV-like domain of the human 
CEACAM1 (PDB ID: 2GK2) and the murine Ceacamla (PDB 
ID: 1L6Z) as templates, respectively. 

As shown in the superimposed structures of the 
MedakaCea, FrogCea7, and LizardCeacam19 IgV-like do- 
mains (fig. 1), the major secondary structures are conserved. 
The residues arginine (R64) and aspartate (D82) that form a 
salt bridge are also conserved in the modeled protein struc- 
tures. Also, the amino acid asparagine (N70), suggested to be 
involved in glycosylation, was found to be highly conserved. 

As shown in figure 2, the modeled 3D structures of the IgC- 
like domain of FrogCea2 and PlatypusCeal superimposed 
onto the N-terminus of the murine Ceacamla exhibit notable 
similarity in their secondary structure elements. The two invari- 
ant cysteine residues, which are involved in the formation of 
the disulfide bridge, were found to be spatially conserved in 
the IgC-like domain of the CEA homologs (fig. 2). 

The TM domains of the homologous CEA proteins were 
predicted to adopt an oc-helical conformation (fig. 3). Two 
prime signature motifs were also identified in the TM domain. 

In the cytoplasmic region, consensus ITAM and ITIM were 
identified, where the tyrosine residue is invariant (fig. 4). 



Phosphorylation of ITAM/ITIM initiates of terminates, respec- 
tively, signal transduction pathways implicated in cellular pro- 
liferation (Beauchemin et al. 1997) or regulation of immune 
response. 

Syntenic Mapping of CEA Homologous Genes 

The chromosomal arrangement of the CEA homologous 
genes found in the genomes of all species under study was 
investigated. As shown in figure 5, the homologous CEA 
genes are arranged in clusters, with conserved sequential 
order, transcriptional orientation, number, and flanking 
genes, in all species under investigation — at least in the 
high-coverage genomes. We identified eight (l-VIII) conserved 
gene clusters in our study, which are indicated by roman nu- 
merals and numbers according to the order of their appear- 
ance in the evolutionary timetable (fig. 5). The "ancestral" 
Cluster I appeared first in the common ancestor of extant 
amniotes and contains the genes CEACAM20, CEACAM19, 
and CEACAM16. Cluster II was emerged in the common an- 
cestor of extant eutherians for the first time and contains a 
single gene, CEACAM18. Cluster III appeared in the common 
ancestor of euarchontoglires and laurasiatheria for the first 
time (CEACAM21). Subsequent duplications of CEACAM21 
have apparently given rise to CEACAM3-7 in the primate lin- 
eage. Clusters IV— VII are restricted to the glire lineage and 
more specifically to rodents. Cluster VIII is primate specific 
because it was detected only in primates. Of particular note, 
the CEA homologs of the New World monkeys are localized 
on chromosome 19. Because of incomplete genomic studies, 
the CEA genes of several organisms were detected in 
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Fig. 2. — IgC-like domain. (A) Modeled protein structures of the IgC-like domain of FrogCea2 (green) and PlatypusCeal (brown) superimposed onto the 
murine Ceacamla (PDB ID: 1 L6Z) (purple). The cysteines involved in the formation of the disulfide bridge are shown as a ball-and-stick representation, and 
the disulfide bridges are indicated by yellow lines. (B) The conserved proteins motifs are detected in the IgC-like domain, numbered according to murine 
Ceacamla. The invariant cysteine residues are denoted by dots. 
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Motif 1 Motif 2 

Fig. 3.— TM domain. (A) Modeled TM helices of HumanCEACAMI 
(purple), LizardCeacam19 (gold), FrogCea7 (green), and ZebrafishCeal 
(red). (B) The conserved TM protein; the amino acid numbering is based 
on HumanCEACAMI. 



chromosomal fragments. Therefore, in this study, a CEA 
member was considered to be absent both if the gene was 
not detected and the CEA genes that flank it in the prototypic 
human and mouse sequential order are detected in the same 
chromosome, scaffold, or contig (fig. 5). 

Phylogenetic Analyses 

To investigate the evolutionary relationships among CEA, 
comprehensive phylogenetic analyses based on the entire 
length protein sequences of all species under study were 
conducted. Two different methods for phylogenetic 
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Fig. 4. — Cytoplasmic domain. ITAM and ITIM. The invariant tyrosine residues are indicated by dots, numbered according to HumanCEACAM19 and 
MouseCeacaml, respectively. 




Fig. 5. — Schematic depiction of the chromosomal arrangement of CEA genes. The orientation of transcription and approximate position and size of CEA 
genes are indicated. The genomic boundaries of each chromosome/scaffold/contig are shown in parentheses. The CEA protein encoding genes are shown as 
filled arrowheads, and the CEA pseudogenes are indicated by open arrowheads. The non-CEA genes flanking the CEA genes are shown in dark red. The CEA 
gene clusters are indicated by roman numerals and different coloration. An NCBI-derived cladogram illustrating the evolutionary relationships of the taxa 
under study is shown on the left. Chr: chromosome; Scf: scaffold. 

(continued) 
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Fig. 5. — Continued. 



reconstruction, ML and neighbor-net, were employed to re- 
solve better the evolutionary relationships. The trees gener- 
ated with both methods are congruent as their overall 
topology is similar (fig. 6 and supplementary fig. S1, 



Supplementary Material online). Representative CEA se- 
quences of selected species with complete or almost complete 
genomes were selected for more accurate phylogenetic anal- 
ysis, using both tree construction methods (fig. 7 and 
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Fig. 6. — ML radial cladogram of CEA proteins. The sequences are represented by the species name and the CEA protein name. The branches are colored 
according to the eukaryotic taxa. CEACAM16, CEACAM18, CEACAM19, CEACAM20, and CEACAM21are highlighted by different shading. 



supplementary fig. S2, Supplementary Material online). The 
low support values (below 50) in some nodes suggest alter- 
native branching patterns. 

The CEACAM16, CEACAM18, CEACAM19, and 
CEACAM20 homologs form their own distinct clades with 
representatives from almost every taxonomic division (figs. 6 
and 7, supplementary figs. S1 and S2, Supplementary Material 
online). CEACAM19 and CEACAM20 appear to be the primor- 
dial genes of the CEA family (fig. 5) because they were found 
in the common ancestor of amniotes. However, neither 
CEACAM19 nor CEACAM20 orthologs were identified in pro- 
totheria (platypus) (fig. 5); this is probably due to incomplete 
genomic studies. CEACAM16 was detected for first time in 



prototheria, whereas CEACAM18 appeared later in evolution 
in the common ancestor of extant eutherians. CEACAM21 
orthologs also forms a coherent clade (figs. 6 and 7, supple- 
mentary figs. S1 and S2, Supplementary Material online). 
Interestingly, CEACAM21, which was detected first in the 
order of perissodactyla and then again in the superfamily of 
apes (fig. 5), is restricted to three species, namely human, 
orangutan, and horse. This finding triggers the speculation 
that either a CEACAM21 gene may have existed in other spe- 
cies, which was deleted during the course of evolution, or 
CEACAM21 evolved independently within these three species. 

CEACAM1, 3-8 were found only in primates (fig. 5). In 
particular, CEACAM5, CEACAM7, and CEACAM8 appeared 
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in the common ancestor of New World monkeys (marmoset) 
for first time whereas CEACAM1, CEACAM3, CEACAM4, 
and CEACAM6 arose later in Old World monkeys (macaque) 
(fig. 5). CEACAM1 and CEACAM3-8 appear to form separate 
monophyletic branches (albeit moderately supported) (figs. 6 
and 7, supplementary figs. S1 and S2, Supplementary Material 
online), leading to the suggestion that primate-specific 
CEACAM1, 3-8 gene duplications must have taken place. 
The domain organization of CEACAM1, 3-8 is also preserved 
among species (fig. 7). The GPI anchor was detected only in 
primates CEACAM5-8 corroborating, in this way, previous 
reports (Naghibalhossaini and Stanners 2004). 

Based on the phylogenies (figs. 6 and 7, supplementary 
figs. S1 and S2, Supplementary Material online), the PSG 
genes of primates and the corresponding Psg genes of rodents 
form two different monophyletic branches, leading to the 
suggestion that PSG and Psg genes have likely expanded 
after the divergence of primates and rodents. Given that the 
PSG protein sequences of apes cluster with the corresponding 
PSGs of the fellow apes and not with the PSGs of their own 
species (e.g., human, chimpanzee, and orangutan PSG3) (figs. 
6 and 7, supplementary figs. S1 and S2, Supplementary 
Material online) along with the observation that both their 
length and domain organization are different (fig. 7), we spec- 
ulate that the PSGs of apes were derived from duplication 
events that have presumably preceded the speciation of 
apes. On the other hand, the PSGs of the New World monkeys 
form a subclade within the PSG clade (fig. 6 and supplemen- 
tary fig. S1, Supplementary Material online), suggesting that 
they have rather evolved independently of those in apes. 

As opposed to primate PSGs, a series of species-specific 
gene duplications must have occurred in rodents yielding 1 1 
Psg paralogs (Cluster VI) (fig. 5) in mouse, which share signif- 
icant sequence similarity (figs. 6 and 7 and supplementary figs. 
S1 and S2, Supplementary Material online). Regarding the 
domain organization, mouse Psg proteins harbor three IgV- 
like domains, whereas primate PSGs possess only one (fig. 7). 
In the rodent-specific Cluster VII (fig. 5), Ceacaml and 
Ceacam2 are likely the products of a tandem gene duplication 
subsequent to the mouse-rat divergence because mouse 
Ceacaml and Ceacam2 cluster together with high confidence 
(figs. 6 and 7, supplementary figs. S1 and S2, Supplementary 
Material online). However, this is not the case in the genes 
located in Clusters IV and V (fig. 5), which appear to have 
expanded prior to the rodent speciation (figs. 6 and 7, sup- 
plementary figs. S1 and S2, Supplementary Material online). 

The Cea homologs that were identified in zebrafish and 
pufferfish are located in two different chromosomal frag- 
ments flanked by the Grik5 co-orthologs, Grik5(1) and 
Grik5(2) (fig. 5). Given that a whole-genome duplication oc- 
curred in teleost fishes subsequent to their divergence from 
nonteleost ray-finned fishes, approximately 320^00 Ma 
(Hoegg et al. 2004; Jaillon et al. 2004; Meyer and Van de 
Peer 2005; Kasahara et al. 2007), it would be reasonable to 



suggest that the zebrafish and pufferfish Cea are probably the 
products of this teleost-specific duplication. 

According to the phylogenetic trees (figs. 6 and 7, supple- 
mentary figs. S1 and S2, Supplementary Material online), the 
teleost Cea protein sequences cluster in a well-supported 
monophyletic clade (with a bootstrap value of 91) (fig. 7). 
Therefore, the Cea genes detected in the contemporary 
teleost genomes must have been derived from a series of lin- 
eage-specific duplications, as in the case of amphibian, pro- 
totherian, metatherian, and specific therian CEA-related 
genes. This hypothesis is also supported by the relatively 
large evolutionary distances and the diverse domain organiza- 
tion of the proteins encoded by the above genes (fig. 7). 

Discussion 

Several experimental studies have focused on the expansion of 
CEA in specific species or taxa (Zhou et al. 2001; McLellan 
et al. 2005; Zebhauser et al. 2005; Weichselbaumer et al. 
2011). In a more recent experimental effort, several 
C£/\-related genes were also detected in vertebrates (Chang 
et al. 2013). The availability of a growing number of se- 
quenced genomes enabled us to perform, for the first time, 
comprehensive phylogenetic and structural analyses of CEA. 
In this study, CEA members were identified in organisms from 
different taxonomic divisions, ranging from cartilaginous 
fishes to humans. An EST sequence was detected in little 
skate (L. erinacea), which was found to be a CEA homolog 
based on BLAST searches. This allowed us to trace the evolu- 
tionary origin of CEA approximately 450^20 Ma when chon- 
drichthyes emerged (Venkatesh et al. 2014). A large number 
of CEA members were detected in teleosts, frog, platypus, 
opossum, elephant, armadillo, dog, and horse without any 
homologs from other species, suggesting line- 
age/species-specific gene amplification. PSG homologs were 
detected exclusively in the superorder of euarchontoglires (pri- 
mates and rodents), which have hemochorial placentae 
(Carter and Enders 2004) and not in other mammalian 
orders with different type of placentation such as epithelio- 
chorial or endotheliochorial (Zeiler et al. 2007). On the basis of 
this finding, we could suggest that PSGs have expanded after 
the radiation of euarchontoglires to perform functions related 
to the hemochorial mode of placentation. 

Subsequently, phylogenetic reconstructions were per- 
formed with the entire length of the CEA-encoded proteins 
to include all the available evolutionary information that is 
present in the amino acid sequences. In this way, a series of 
sequentially and spatially conserved amino acids were also 
identified in the IgV- and IgC-like domains. The conservation 
of these residues across the diverse CEA family members sug- 
gests the importance of these residues in the overall structure 
and function of CEA. 

In this study, the chromosomal arrangement of the CEA 
homologs in all species under investigation was examined. A 
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Fig. 7. — ML phylogram of representative CEA proteins. For clarity, the mouse Psg clade (Cluster VI) is condensed and shown separately. Bootstrap values 
greater than 50% are indicated at the nodes. The branch lengths depict evolutionary distance. The domain organization of CEA proteins is presented on the 
right of the corresponding sequences. The domain legends are shown in the figure inset. The scale bar at the upper left indicates the length of amino acid 
substitutions per position. The conventions are the same as in figure 6. 
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prominent feature of the CEA gene family is that it consists of 
clusters of genes with conserved order and orientation, 
mapped to a single chromosome, in all species. On the basis 
of both syntenic and phylogenetic analyses, we identified a 
total of eight (l-VIII) conserved gene clusters, the first one 
appearing for first time in the common ancestor of amniotes. 
Moreover, the flanking non-CEA genes such as Grik5, PVR, 
SIGLECs, and IGSF23 are also conserved with the same order 
and position in all organisms under study. Given that shared 
synteny is likely associated with function (Wang et al. 2008), 
we suggest that these genes may have evolved along with 
CEA to complement CEA's function. 

The CEA gene family represents a notable example of gene 
duplication, a process suggested to be essential for the devel- 
opment of novel genes (Demuth et al. 2006). The extensive 
presence of duplicated genes such as kallikreins (Pavlopoulou 
et al. 2010), bitter taste receptor (72/?) genes, mammalian 
lysozyme gene family (Dong et al. 2009), genes encoding 
for keratin associated proteins (KRTAPs) (Wu et al. 2008), 
and the oxidative phosphorylation (OXPHOS) gene families 
(De Grassi et al. 2008), all of which are implicated in important 
physiological processes, points out the importance of this pro- 
cess. We assume that successive rounds of gene duplications, 
followed by deletions, inversions, translocations, and diver- 
gence have likely given rise to the CEA genes found in the 
contemporary genomes. 

Of particular note, both ITAM and ITIM were detected to- 
gether in human, orangutan, mouse, horse, and cow 
CEACAM proteins. This observation leads to the suggestion 
that evolutionary pressure could have applied to ITAM and 
ITIM, motifs exerting opposing signaling effects (activating 
vs. inhibitory), to coevolve (Kammerer and Zimmermann 
2010). In particular, recognition of bacterial pathogens by 
CEACAM3 results to phosporylation of its ITAM by protein 
kinases of the Src family; in turn, a signal transduction path- 
way is initiated that leads to bacterial engulfment and killing 
(Hauck et al. 1 998; McCaw et al. 2003). On the other hand, it 
was shown that the presence of the ITIM of CEACAM1 was 
essential to suppress adaptive immune response upon bacte- 
rial infection of the genus Neisseria (Boulton and Gray-Owen 
2002). Moreover, CEACAM1, as opposed to CEACAM3, acts 
as a tumor suppressor, shown to inhibit the growth of pros- 
tate, colon, and breast tumors (Estrera et al. 2001; Volpert 
et al. 2002; Sappino et al. 2012). The ITIM could presumably 
account for CEACAM1 's tumor suppressive properties. In this 
study, an ITAM was also detected in the cytoplasmic tail of 
CEACAM1 9, which is overexpressed in several types of cancer 
(Scorilas et al. 2003; Michaelidou et al. 2013). The oncogenic 
potential of CEACAM1 9 may, at least partially, depend on the 
presence of the ITAM. Further experimental studies could 
probably verify the signaling regulatory role of ITAM/ITIM in 
various cellular activities. 

We expect that the findings of our study could lay the 
foundation for the design of experimental studies directed 



toward the elucidation of the biochemical function of the 
putative CEA and CE4-encoding proteins, taking into consid- 
eration the identified protein patterns. The conserved amino 
acids, also, detected in the protein sequences could represent 
potential drug targets and should be considered in light of 
their exploitation in the design of therapeutic agents in 
anticancer research. 

Supplementary Material 

Supplementary table S1 and figures S1 and S2 are available at 
Genome Biology and Evolution online (http://www.gbe. 
oxfordjournals.org/). 
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